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Abstract 

In convex optimization, there is an acceleration phenomenon in which we can boost the 
convergence rate of certain gradient-based algorithms. We can observe this phenomenon in 
Nesterov’s accelerated gradient descent, accelerated mirror descent, and accelerated cubic- 
regularized Newton’s method, among others. In this paper, we show that the family of higher- 
order gradient methods in discrete time (generalizing gradient descent) corresponds to a family 
of first-order rescaled gradient flows in continuous time. On the other hand, the family of accel¬ 
erated higher-order gradient methods (generalizing accelerated mirror descent) corresponds to 
a family of second-order differential equations in continuous time, each of which is the Euler- 
Lagrange equation of a family of Lagrangian functionals. We also study the exponential variant 
of the Nesterov Lagrangian, which corresponds to a generalization of Nesterov’s restart scheme 
and achieves a linear rate of convergence in discrete time. Finally, we show that the family of 
Lagrangians is closed under time dilation (an orbit under the action of speeding up time), which 
demonstrates the universality of this Lagrangian view of acceleration in optimization. 


1 Introduction 

In convex optimization, many discrete-time algorithms can be interpreted as discretizing a continuous¬ 
time curve converging to the optimal solution f* of the optimization problem: 

min/(x). 

For example, the classical gradient descent algorithm in discrete time {k G {0,1,2,...}) 

Xfc+i = Xfc - eV/(xfc) (1.1) 

can be viewed as the algorithm obtained by applying the backward-Euler method to discretize 
gradient flow {t > 0) 

M = -V/(W). (1.2) 
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Many methods, including (1.1) and (1.2) above, can be interpreted as minimizing a regularized 
approximation of the objective function /. Indeed, gradient descent can be written as: 

Xk+l — X]f -|- Vk 

Vk = argmm|/(xfc) + (V/(xfc),u) + ^ (1-3) 

whereas gradient flow can be written as: 

Xt = argmm|/(Xt) + {Vf{Xt),v) + | • (1-4) 

Moreover, (1.3) and (1.4) have matching convergence rates; gradient descent has convergence rate 

/(xA,)-r <o(^) (1.5) 

when / is smooth (has (l/e)-Lipschitz gradients) and convex, where the O(^) term in the bound 
above is with respect to A; —>■ oo, with e fixed (more precisely, f{xk) — /* < ^ for all sufficiently 
large k, for some constant C > 0). Similarly, gradient flow has convergence rate 

/(Xi)-r <oQ) (1.6) 

when / is convex, without requiring smoothness, and the 0{j) term above is with respect to 
t —)• oo. Note that the backward-Euler method discretizes the curve using the identification Xk = Xt, 
Xk+i = Xt+s ^ + SXt = Xk + Vk, with the time-step 5 set equal to the step size e of the discrete 

time algorithm (equivalently, with time scaling t = efc). 


1.1 Summary of Results 

The link between discrete-time algorithms and continuous-time curves, and their matching proper¬ 
ties (i.e. convergence rates) extends far beyond gradient descent (1.1) and gradient flow (1.2). We 
begin (Section 2) by studying higher-order gradient algorithms Qp (p > 2): 


Xk+l — Xk -|- Vk 


1 1 , 


Vk = argmin \ fp_i{v;xk) H- \\v 

- ' e p 


where fp-i{v,x) is the {p — l)-st Taylor approximation of f{x -\- v) centered at x: 

1 1 

fp-i{v,Xk) = ^Ty/(xfc)u* = f{xk) + {Vf{xk),v)^ - h , _ VP~^f{xk)[v,...,v]. 

i=o \P )■ 


(1.7) 


( 1 . 8 ) 


The p-th order gradient method Gp, with the ansatz Xk = Xt, Xk+i = Xt+s ~ + ^Xt = Xk + Vk, 

1 1 

and time-step 6 = er-i (equivalently, with time scaling t = k), discretizes a p-th order rescaled 

gradient flow: 


Xt = argmin 

V 


f{Xt) + {Vf{Xt),v) + 



(1.9) 
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Note (1.9) can also be written as: 


Xt = 


XfjXt) 

l|V/(X,)||F 


( 1 . 10 ) 


Furthermore, (1.7) and (1.9) have matching convergence rates; the p-th. order gradient method has 
convergence rate: 

when is ^^^' -smooth and / is convex, and the rescaled gradient flow has convergence 

rate: 

f{Xt)-r<o[^) 

when / is convex. 

In Section 3, we present an algorithm which generalizes accelerated gradient descent [7] and the 
accelerated Newton method [9], and accelerates the family of higher-order gradient algorithms (1.7). 
Building on the work of Su, Boyd, and Candes [13] (for the p = 2 Euclidean case), in Section 3.3 
we show that the p-th. order accelerated gradient method Gp discretizes a second-order differential 
equation we call Nesterov flow: 

Xt + ^^Xt + CpH^-^v^h + ^Xt^ vf(Xt) = 0 


under the time step 5 = ep (or time scaling = ekP). Moreover, the p-th order accelerated gradient 
algorithm Qp and its corresponding Nesterov flow have matching convergence rates; the p-th order 
accelerated gradient method has convergence rate: 

/M-/-< 0 (^) 

when is -smooth and / is convex, and the corresponding Nesterov flow has conver¬ 

gence rate (Section 4.1): 

/(Xt)-r <o(i) 

when / is convex. Note that the family of Nesterov flows are second-order differential equations in 
time and the rescaled gradient flows are first-order differential equations in time. 

In Section 5, we show that the Nesterov flows are a subfamily of the Bregman flows: 

Xt +^tXt + V^h (^Xt + e-'^^Xt) V/(Xt) = 0 

where at = — logt -|- logp, fit = {p — 2) logt -|- 21ogp -|- logC, and 7 * = (p -|- 1) logt — logp. Under 
an ideal scaling relationship between at^flt^lt (satisfied by the Nesterov flows), each Bregman flow 
satishes the Euler-Lagrange equation of a Bregman Lagrangian functional: 

V, t) = e^^ {e^'^^Dh (x + e““‘u, x) - e^*f{x)'j . (1.11) 
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Therefore, the family of Nesterov flows (4.1) can be interpreted as optimal curves under the principle 
of least action, which posits that curves evolve so as to minimize a quantity known as an action, 
defined as the time integral of a Lagrangian functional C{X, X,t). 

In Section 6.1, we introduce exp-Nesterov flows, another subfamily of the Bregman flows that 
satisfies the ideal scaling (where at = logc, ft = ct + 21ogc, yt = ct — logc). The exp-Nesterov 
flows have an improved convergence rate: 

/(x,)-r <o(T). 

In Section 6.2, we show how to discretize the exp-Nesterov flow, and with the additional assumption 
of uniform convexity, obtain a discrete-time algorithm with matching linear rate. The algorithm 
presented generalizes the restart scheme of Nesterov [9], and is optimal (attains the lower bound [7, 
Section 2.2.1]) when / is both smooth and strongly convex (i.e. p = 2). 

Finally, in Section 7 we demonstrate how time can be used as an organizing tool to understand 
the various algorithms presented in this paper. Indeed, in continuous time optimization, if we 
start with a curve Xt with convergence rate f{Xt) — f* < 0(e^‘), we can simply consider the 
sped-up curve 1) = X^f^t)^ where r: M+ —)> M_|- is a monotonically increasing function. This new 
curve Yt has faster convergence rate f{Yt) — f* < 0(e^^(‘)), where p^-t^t) > Pt- In this paper, we 
explore groups of time dilation functions r and their corresponding group action on the space of 
curves. We show that the family of Bregman Lagrangian functionals forms an orbit under the 
group action of time dilation; moreover, the family of Nesterov Lagrangian functionals (Section 4) 
and the family of exp-Nesterov Lagrangian functionals (Section 6) form isomorphic sub-orbits. We 
can therefore interpret the curves corresponding to family of accelerated methods 0 as the result 
of speeding up (or traversing faster) the single curve corresponding to accelerated gradient descent. 
The cost for translating these faster curves into discrete-time algorithms (in addition to significant 
computational costs) is increasingly restrictive smoothness assumptions on the function. 

We summarize in Figure 1 the relations between the objects we study in this paper. We see a 
consistent, almost parallel structure between continuous time (top layer) and discrete time (bottom 
layer). As the key conceptual message of this paper, we find there is a big difference in the nature 
of first-order equations (such as gradient flow) and second-order equations (such as accelerated 
gradient flow), due to the connection to the Lagrangian framework. Working with second-order 
equations provides better results in both continuous and discrete time. 

1.2 Notation and Preliminaries 

We formalize our setting and recall some basic definitions. Our objective is to minimize a convex 
objective function /: A —)> M, which means the graph of / lies above any tangent hyperplanes: 

f{y)> f{x) + {Yf{x),y-x) (1.12) 

or equivalently, any intermediate value is at most the average value {Jensen’s inequality)'. 

/(Ax -h (1 - X)y) < A/(x) + {1- X)f{y) 
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V'‘/>(X,) X, X, + X'‘h(X,) (X, + 7, X,) + e* V/(X,) - 
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for all x,y G X and 0 < A < 1. We assume / is smooth and continuously differentiable as many 
times as necessary. 

The domain X is an open convex subset of a vector space, say T C (we take X = 
for simplicity). In particular, we take the point of view that T is a manifold. In particular, any 
point X ^ X has a tangent space T^X (vector space of instantaneous displacements v such that 
x + ev ^ X ioT small e > 0); Since X C M'^, we can identify with X (or M^) itself. But now we 
can have an interesting mixing of timescales between the points x ^ X and the vectors v G ~^xX, 
which are now “promoted” from e to the standard timescale. Indeed, we will see in Section 4 
that Nesterov’s acceleration technique involves the choice of mixing x and v using e = |—which 
is counterintuitive because it is increasing, and yields sublinear rate of convergence. On the other 
hand, the exponential variant of Nesterov uses e = ^—which is more reasonable, and yields linear 
rate of convergence. 

The cotangent space T^T is the dual vector space to TxX (the space of linear functional (j) 
on TxX). For example, the gradient V/(x) G T*X is a covector, which defines the directional 
derivative of /: 

/V 7 ./ ^ \ _ w. ^ 1 - f(x + ev)-f(x} 

(Vf(x),v} = f(x;v) := Inn-. (1.13) 

e—>0 e 

Because we use the .^ 2 -norm, we can identify TxX = T*X by the identity map (which we implicitly 
use in (1.1)); but note that conceptually, they are different spaces. Similarly, the dual norm || • ||* 
is also the same 1 ' 2 -norm, but we will maintain the distinction of || • ||*. 

We say / is L-smooth of order 1 > 0 if V^/ is L-Lipschitz (and is continuous): 

||VV(x)-VV(y)||* < L\\x-y\\. (1.14) 

The case 1 = 0 means / is Lipschitz, and 1 = 1 is the usual smoothness definition (V/ is Lipschitz). 
We say h is a-uniformly convex of order p >2 (p = 2 is strong convexity) if: 

Dh{y,x) > -\\y-x\\P. (1.15) 

P 

where Dh{y, x) = h{y) — h{x) — {'Vh{x),y — x) > 0 is the Bregman divergence induced by a strictly 
convex distance generating function /i: T —)• M. 

1.3 Gradient algorithms 

We review gradient-based algorithms that correspond to first-order equations in continuous time. 


1.3.1 Mirror descent and natnral gradient flow 

The mirror descent algorithm 


^/c+1 


= argmin 

V 


(V/(xfc),x) -1 -Dh{x,Xk) 


( 1 . 16 ) 


6 



which can be written as 


^k-\-l — “t“ Vj^ 

Vk = argmin|/(xfc) + {Vf{xk),v) + - Dh{xk + v,Xk) 

V I e 

measures displacement by the Bregman divergence. Note that gradient descent (1.3) takes h{x) = 
^||x|p, X = and the classical multiplicative weight update uses h{x) = — ^XjlogXj, X is the 
probability simplex. In general, there are often suitable choices of h that confer some computa¬ 
tional gain in practice (e.g., milder dimension dependence). Similar to (1.5), mirror descent has 
convergence rate 

/(xfc)-r (1.17) 

when V/ is ^-Lipschitz, and h is strongly convex (i.e., uniformly convex of order 2). 

In continuous time, mirror descent corresponds to (with 6 = e, t = ek) natural gradient flow: 

Xt = -V^h{Xt)-^Vf{Xt) (1.18) 

which can be cast as the solution to the optimization problem 

Xt = argmm|/(Xi) + {Xf{Xt),v) + ^||u||^(xo} (1-19) 

Natural gradient flow is a steepest descent direction when the metric in X is induced by the Hessian 
V^/i. Thus, mirror descent (1.16) can be seen as an alternative discretization of the natural gradient 
flow, another being Amari’s [2] natural gradient descent^: 

Xk+i = Xk- eV‘^h{xk)~^Vf{xk). ( 1 . 20 ) 

Similar to (1.19), we can interpret (1.20) as the solution to the optimization problem 

Xk-\-l — 

Vk = argmm|/(xfc) + (V/(xfc),u) -b ^ 

where = {V‘^h{x)v,v) is the Hessian norm induced by h. Furthermore, like gradient 

flow (1.6), natural gradient flow has convergence rate 

/(Xi)-r (1.21) 

^This equivalence has also been observed by [12]; see Appendix A for further discussion. 
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1.3.2 Newton’s method and Newton’s flow 


Newton’s method optimizes a quadratic approximation of the objective function /: 

= argmin |/{x,) + {V/fe),x) + i{VVfe)(x - x - x,) 


can be written explicitly as 


Xk+i = Xk - eV^f{xk) V/(xfc) 


( 1 . 22 ) 


(1.23) 


The original Newton’s method corresponds to e = 1, but there have been many proposed choices 
of step size e to improve stability and ensure convergence (e.g., see [10] and references therein). 

We can also interpret (1.23) as natural gradient descent (1.20) when h = f. Thus, in continuous 
time it corresponds to Newton’s flow:'^ 

Xt = -V^f{Xt)-^Vf{Xt) (1.24) 

which is natural gradient flow (1.18) with f = h. However, convergence results for the scheme (1.23) 
are difficult to obtain and have all been local. Only in special cases (e.g., self-concordance, a local 
Lipschitz condition on V^/) are we able to have any strong guarantee on Newton’s method. 


1.3.3 Cubic-regularized Newton’s method and rescaled gradient flow 


To address this issue, Nesterov and Polyak [10] proposed cubic-regularized Newton’s method, which 
optimizes a second-order approximation of / plus regularization: 

Xk+i = argmm |/(xfe) {Vf{xk),x- Xk) + f{xk){x - Xk),x- Xk) + ^\\x - (1.25) 

They showed [10, Theorem 4] that if V^/ is ^-Lipschitz, then (1.25) has convergence rate guarantee 


^ r* / 2711x0 -x*]]^ _ ^ 1 ^ 

^ - e(A: + 3)2 ^ [eky ' 

As mentioned in Section 1.1, we show (Section 2.2) that in continuous time (with 6 
ek'^), (1.25) corresponds to the rescaled gradient flow: 

V/(Ai) 




iiv/(Aoiiy' 


(1.26) 


ei, e = 


(1.27) 


and that (1.27) has matching convergence rate 


fixt) -r<o{^ 


(1.28) 


Finally, we note that by adding a regularization term in Newton’s method (1.25), Nesterov 
and Polyak changed the problem from discretizing a Newton’s flow (1.24) to rescaled gradient 
flow (1.27). Thus, the two variants (1.23), (1.25) differ quite a bit in continuous time. 


^Note, Newton’s flow is explicitly solvable: Xt = V/*(e *V/(No)), where f* is the convex dual of / 





1.4 Accelerated gradient algorithms 


We review accelerated gradient algorithms that correspond to second-order equations in continuous 
time. 


1.4.1 Accelerated gradient descent 

Nesterov’s accelerated gradient descent [ 6 , 7] improves the performance of gradient descent (1.1) 
from 0{l/ek) to the optimal 0{l/ek‘^). This gain is achieved not by strengthening the assumption 
on /, but by incorporating the displacement — x^-i to shift the point at which we query the 
gradient V/ (thus, this method is often referred to as gradient descent “with momentum”). The 
algorithm [7, (2.2.6)] maintains three sequences {xk}-,{yk}i{zk} and proceeds as follows. For any 
ya = zoe X,k>Q: 

Xk = TkZk + {I- Tk)yk, (1.29a) 

yk+i = Xk- eVf{xk), (1.29b) 

Zk+i = Zk- —Vf{xk). (1.29c) 


Here e > 0 is the step size, and Tk G (0,1) is defined recursively by r_i = 1 and the rule, for k > 0: 

S^ = rLi- (1-30) 

We can also see that Tk = 2/k + o{l/k), for indeed we have = 33 ^ = k{k_ 2 ) ~ {k\)^ ~ 
Nesterov showed [7, Theorem 2.2.2] that when V/ is ALipschitz, then (1.29) satisfies: 


f{yk) -r< 


4||xo — x*|p 

e(A: + 2)2 



(1.31) 


which improves the 0(l/eA:) rate (1.5) of gradient descent, and matches the lower bound. 

As pointed out by Su, Boyd and Candes [13], in continuous time accelerated gradient descent 
corresponds to the second order equation: 

Xt + jXt + Vf{Xt) = 0 (1.32) 

with time scaling 5 = = efc^, and matching convergence rate [13, Theorem 3.2]: 


f{Xt) -r<o 



(1.33) 


1.4.2 Accelerated mirror descent 

In [ 8 ], Nesterov proposed accelerated mirror descent, which proceeds much like the Euclidean 
case (1.29), except we replace the z-update (1.29c) by a (weighted) mirror descent step (1.16). 
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The algorithm [8, (3.11)] maintains three sequences {xk},{yk},{zk} and proceeds as follows. For 
any xq G X, k > 0: 


yk= argmin U{xk) + {Vf{xk),y - Xk) + ^\\y - Xk\ 
y [ 2e 


Zk= argmin ^ [/(xj) + (V/(xi), 2 ; - Xj)] + ^ ■ ^Dhiz,xo) 


e a 


(1.34a) 
(1.34b) 

Under the same assumption as mirror descent (V/ is ^-smooth and h is d-strongly convex), the 
accelerated version (1.34) has improved (optimal) convergence [8, Theorem 2]: 


Xfc+l — 


^ I ^ 2 

I i=0 

2 k + l 

+ 1 —^ yk 


f(n, f* <z 4T>fe(x*,xo) _r,( J_ 

fiVk) f - +i)(fc + 2) VeA:2 


(1.35) 


Note, using the equivalence of mirror descent as the cascaded version of dual averaging, we can 
write the update (1.34b) above recursively using the standard mirror descent algorithm:^ 


k -\- 1 11 

Zk = argmin <; -^—{Vf{xk),z) + - ■ -Dh{z,Zk-i) } ■ 


(1.36) 


In the Euclidean case (when h = ||| • jjl and setting a = 1), the update (1.36) above simplifies to 
the explicit rule Zk = Zk-i — v/(xfc), recovering accelerated gradient descent (1.29). 

Similar to (1.37), accelerated mirror descent corresponds to the second order equation: 

-1 

(1.37) 


Xi + -^Xt + V^/i (x^ + \Xt ) Vf{Xt) = 0 


with time scaling 6 = = efc^, and matching convergence rate: 

1 


f{Xt) -r<o^ 


(1.38) 


1.4.3 Accelerated cubic-regularized Newton’s method 


In [9], Nesterov proposed the accelerated eubic-regularized Newton’s method^ which proceeds as (1.29), 
except we replace the y-update (1.29b) by cubic-regularized Newton’s method (1.25). The algo¬ 
rithm [9, (4.8)] maintains three sequences {xfc}, {yt}-, {zk} and proceeds as follows. For any xq G X, 
k>0: 

yk = argmm|/(xfc) {Vf{xk),y- Xk) + f{xk){y - Xk),y - Xk) + ^\\y - XfcU^j (1.39a) 

, (1.39b) 

(1.39c) 


Zk = argmmi + {Vf{yi),z- y,)] X^\\z- Xq 

3 k 

^k+i — T~T~^Zk + , , ^yk 


k + 3 


k + 3^ 


^This observation has also been reported by [1]. 
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Under the same assumption as the cubic-regularized Newton’s method, V^/ is |-Lipschitz, the 
accelerated algorithm (1.39) has convergence rate [9, Theorem 2]: 


fivk) -r< 


14||xo - 

ek{k + l)(/i; + 2) 



(1.40) 


As noted in Section 1.1, we show (Section 3.3) the accelerated algorithm (1.39) corresponds to 
the following differential equation: 


Xt + -Xt + 9tX^h 


X. + -X. 


-1 


V/(At) = 0 


(1.41) 


with (5 = 63, = ek^, where h{x) = lllxp in (1.39). Furthermore, (1.41) has matching convergence 

rate: 


f{Xt) - r < o . 


(1.42) 


2 Higher-order gradient methods and rescaled gradient flows 

We study the family of higher-order gradient methods Gp € & in discrete time, which corresponds 

1 

to first-order rescaled gradient flow in continuous time, with time step S = , and matching 

convergence rate = 0(l/e/c^“^). 


Surrogate optimization. Recall in discrete time, many optimization algorithms proceed by 
minimizing a surrogate function:^ 


Xk+i = argmin g{x;xk). (2.1) 

xGA. 

Here g{x',Xk) is a surrogate function that majorizes the objective function f{x), which means for 
any reference point G A and for any point x G A, it satisfies the inequality: 

g{x]Xk)> fix) (2.2) 

and equality holds at x = x^. The property (2.2) above implies the algorithm (2.1) is a descent 
method, which means the function value f{xk) decreases along the iteration of the algorithm: 

( 2 . 2 ) ( 2 . 1 ) 

f{xk+i) < g{xk+i'-,Xk) < g{xk-,Xk) = f{xk). (2.3) 

We can minimize / by finding an appropriate surrogate function g{x;xk) that is more tractable 
to minimize, and performing the descent algorithm (2.1). Moreover, under various assumptions, 
we can quantify the decrease in the function value (2.3), resulting in a rate of convergence for the 
algorithm (2.1). 

“^See [5] and the references therein for more information on majorization-minimization principle in optimization. 
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Surrogate via Taylor expansion. A natural technique for constructing a surrogate function 
g{x;y) is to use the Taylor approximation of / at x from the reference point y 

p-i 1 1 

fp-i{x',y) = -yT = f{y) + {^f{y),x-y) + ---+ _ wp-^f{y){x-y)p-^. (2.4) 

i^O [P )■ 

If / is L-Lipschitz of order p—1 (1.14), then we have a bound on the approximation error of fp-i- 

\f{x) - fp-i{x-,y)\ < ^^\\x-y\\P. (2.5) 

Therefore, we have the family of (regularized) Taylor surrogate functions, for p > 1: 

9p{x]y) ■■= fp-i{x-y) + —\\x-y\\P. (2.6) 

p\ 

Note also the tangency property gp{x;x) = fp-i{x;x) = f{x), so gp is indeed a surrogate function. 

2.1 Higher-order gradient method 

The Taylor surrogate functions (2.6) give rise to higher-order gradient methods Qp € &, defined by 
the update equation: 

Xk+i = argmin | /p_i(x;xfc) + - • “ Xk\\^ \ ■ (2.7) 

a: [ e p } 

The p = 2 case gives the gradient descent algorithm (1.1), and p = 3 is Nesterov and Polyak’s [10] 
cubic-regularized Newton’s method (1.25). Note also that p = I gives a constant sequence, so we 
only consider p > 2. In (2.7), we write the Lipschitz constant L = = O(^) in terms of the 

step size e > 0, representing the discretization parameter of the algorithm.^ Our discussion above 
says that if / is -smooth of order p—1, then the p-th gradient algorithm Qp (2.7) is a descent 
method, since fixk+i) < f{xk). 

Moreover, we can use the convexity structure of / to further ensure a quantitative decrease in 
the residual 6k = f{xk) — /* > 0: 

Lemma 1. If f is convex and smooth of order p — 1, then the following holds for (2.7): 

4-hi < 4 - ^-p- ■ 6f~^ (2.8) 

i?p-i 

where R = supj(j,)<j( 3 ,p) ||x — x*\\ is the radius of the level set from the initial point xq. 

The proof of Lemma 1 is in Appendix B.l. Using the (discrete time) energy functional £k = 
_ ^ 

6^. , we obtain following convergence rate for Qp, generalizing [9, Theorem 1]: 

®This assumption vanishes in continuous time, since t oo as e —>■ 0. 
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Theorem 2. If f is convex and 




-smooth of order p 


1, then the following holds for (2.7); 


fixk) - f* < 


{p - l)P-^RP 
ekP~^ 



(2.9) 


Thus, the family of higher-order gradient methods G © in discrete time has a nice sequential 
structure. In particular, there is a consistent pattern whereby as p progresses in {2,3,... }, the 
polynomial convergence rate 0(l/efc^“^) decreases, but at the cost of increasingly strict {p — l)-st 
order smoothness assumption on /. This suggests there is a fundamental tradeoff in discrete time 
between speed of convergence and strength of hypothesis required. 


2.2 Rescaled gradient flow 


In continuous time, gradient flow (1.2) is a member (p = 2) of a family of rescaled gradient flows: 


V/(^t) 

l|V/(X0||F 


( 2 . 10 ) 


When Vf{Xt) = 0, we define the right hand side of (2.10) to be 0. Note that (2.10) implies that 
the magnitude of the velocity at some time is proportional to (some power of) the gradient at that 
point: 


Xt 


liv/(x,)iir^ 


( 2 . 11 ) 


so we can also equivalently write the rescaled gradient flow as 

\\Xt\\P-^Xt = -Vf{Xt). 


( 2 . 12 ) 


Furthermore, (2.12) is the optimality condition for the following optimization problem: 

Xt = argmin {f{Xt) + {Vf{Xt),v) + • (2.13) 

V [ P } 

Thus, we can view the rescaled gradient flow (2.10) as a generalization of gradient flow that replaces 
the squared norm in the optimization interpretation (1.4) by the p-th power of the norm, for p > 2. 

If Xt is a curve that evolves following the rescaled gradient flow (2.10), then, using the convexity 
of /, the energy functional: 

gt = {f{Xt)-n~^^ (2.14) 

1 _ p 

increases linearly over time {St > Appendix B.2 for details), which implies the 

following rate of convergence: 


Theorem 3. If f is convex with bounded level sets, then the following holds for (2.10); 


tP 


-1 




(2.15) 


The result (2.15) above matches the discrete time convergence rate (2.9) of the p-th order 
gradient method Qp. Moreover, notice we use the same energy functional (2.14) as in discrete time. 
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2.3 The relation between e and 6 


The matching convergence rates (2.9), (2.15) suggest the following time scaling relation between 
continuous time t >0 and discrete time k G {2,3,...}: 

t = k. (2-16) 

So in one step of discrete time A: i—)• A: + 1, the continuous time t = tk increments by: 

j _ dt _Afe+l tk (2^6) 

dk {k + 1) — k 

The scaling above suggests the interpretation of the p-th. gradient algorithm Qp as a discretization 

1 

of the rescaled gradient flow with time step 6 = ep-^. 

We can understand this scaling phenomenon more explicitly by starting from the continuous 
time view. Suppose we have a continuous time curve Xt (such as the rescaled gradient flow (2.10)), 
and we want to discretize it with time step <5 > 0. This means we build a discrete time sequence 
Xk obtained by taking a snapshot of Xt every 5 increment of time, namely: 

Xf — X)^ Xt-\-s — Xf^^i. (2.18) 

Notice, to go from x^ to x^+i in (2-18) above, we have to first let Xt evolve in continuous time to 
Xt+i, which we then use as the value for x^+i- 

However, to discretize is to build a discrete time algorithm, which means we can only define 
Xk+i in terms of the previous discrete iterates Xk-,Xk-i-, ■ • •, without invoking the continuous time 
curve Xt- Thus, we have to approximate the continuous time evolution from Xt to W+<5- 

For the first-order rescaled gradient flow (2.10), we approximate Xt+s using a linear approxi¬ 
mation: 


Xt+8^Xt + 5Xt (2.19) 

with an error of order o(<5). Equivalently, we replace Xt in (2.13) by the discrete time difference:® 


= argmin |(V/(xfc),u) -k -||t’ir 
0 I P 


( 2 . 20 ) 


Or, writing v = the above can be written as: 

Xk+i = argmm |/(xfc) -k (V/(xfc),x - Xk) + • ^\\x - Xfc||p| . (2.21) 

Thus, we obtain an algorithm similar to the p-th gradient method (2.13), where the step size 
e is given by consistent with the time scaling (2.17). We can interpret the p-th gradient 

method (2.7) as a particular discretization technique of the rescaled gradient flow (2.10), which 
replaces the first-order approximation of / in the “naive” discretization (2.21) by the (p — l)-st 
order approximation (2.7). By doing so, as well as assuming (p — l)-st order smoothness of /, 
the resulting discrete time algorithm Qp has a 0(l/eA;^“^) convergence rate which matches the 
0(l/t^“^) bound in continuous time. 

®We can also start by replacing Xt in (2.10) by (2.19) and achieve the same conclusion. Namely, (2.10) becomes 
^{xk+i-Xk) = -\7 f{xk)/\X f{xk)\\t~^ , or equivalently, \\xk+i-Xk\\’'~^{xk+i-Xk) = f{xk), which is (2.21). 
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3 Accelerated higher-order gradient methods 

In Section 1.4, we see the simple pattern in Nesterov’s constructions of accelerated methods [7, 8, 9]: 

To accelerate an algorithm, couple it with a (suitably weighted) mirror descent step. 

In this section, we extend Nesterov’s technique to accelerate all higher-order gradient methods. 
The accelerated p-th order gradient method Qp is obtained by coupling Qp with a mirror descent 
step weighted by a polynomial of order p — 1. The accelerated algorithm Qp has an improved 
convergence rate 0{l/ek'^) under the same {p — l)-st order smoothness assumption as Qp. Thus, 
just like 0, the family of accelerated gradient methods G © still maintains the nice sequential 
property of the polynomially decreasing convergence rates. ^ 

Throughout this section, we fix an integer p > 2, and assume / is -smooth of order 
p — 1 (1.14). For the mirror descent step, we assume the distance generating function h is a- 
strongly convex of order p (1.15), where o" > 0 is a constant (we can normalize a = 1). For 
example, we can take h to be the p-th power of the norm: 

dpix) =-\\x — xqW’ (3T) 

p 

(for arbitrary reference point xq G X), which is (|)P“^-uniformly convex of order p [9, Lemma 4]. 


3.1 Accelerated p-th order gradient method 

The aceelerated p-th order gradient method Qp maintains three sequences {xk}, {vk}, {zk} as follows. 
Starting from any xq £ X, k > 0 the algorithm proceeds: 


yk = argmin|/p_i(y;xfc) + - • - ||y - 

y[ e P J 


Zk = argmin <| '^Cpi^P Vi)] + - ■ -Dh{z,xo) 


1 i=0 

p k 

Xk+1 — I Zk + ' yk 

k + p k + p 


e a 


(3.2a) 

(3.2b) 

(3.2c) 


where i^~^^ := i{i + 1) ■ ■ ■ {i + p — 2) denotes the rising factorial, and C < {4:p)~P is a constant. 

Note, the y-update (3.2a) above is the p-th gradient method, but with slightly larger regular¬ 
ization (^ with any c > 1; above is c = 2). As noted in Section 1.4, the z-update (3.2b) above is 
given in a “dual averaging” form, because (for example, when X = M'’*) we can write it explicitly: 


k 

Vh{zk) = Vh{xo) - euCp J^z(P-^)V/(yi) = Vhizk-i) - eaCpk^P-^'^Vf{yk). (3.3) 

i=0 

^While preparing this paper, we became aware of an unpublished manuscript by Baes [3] who extended Nesterov’s 
technique of estimate sequence and constructed higher-order variants of the accelerated methods, essentially identical 
to ours. We nevertheless present our generalization of Nesterov’s proof in order to highlight the basic structure. 


15 






Therefore, the z-update (3.2b) can also be equivalently written recursively as a mirror descent step: 

Zk = aigmin I^Cpk^P~^'>{'Vf{yk),z) + z^-i)| • (3.4) 

But it turns out that the expanded form of the update (3.2b) is more convenient for us. 

3.2 Convergence analysis 

We can justify the performance of the accelerated algorithm (3.2) by following a straightforward 
generalization of Nesterov’s arguments [8, 9], which proceeds as follows. We first recall the follow¬ 
ing property for the p-th order gradient step in the y-update (3.2a). This lemma generalizes [9, 
Lemma 6], and its proof is provided in Appendix B.3. 

Lemma 4. If f is -smooth of order p — 1, then the y-update (3.2a) has the guarantee: 

(^fiVk), Xk-Vk) > ^e^l|V/(?/fc)||r^ (3.5) 

With Lemma 4 in hand, we can proceed as follows. Let V’fc (Nesterov’s “estimate function”) 
denote the objective function in the z-update (3.2b): 

^ 1 

i^k{x) = Cp^&~^^[f{yi) + {Vf{yi),x - Vi)] H- Dh{x,xo). (3.6) 

i=o 

Since / is convex, each term in the summation above is at most Cpi^P~^'^f{x). Noting that 
= k^P'^/p, this yields the following upper bound on the estimate function, for all x G A: 

■ipkix) < Ck^^P^f{x) + —Dh{x,xo). (3.7) 

ecr 

Furthermore, the updates in (3.2) are constructed in such a way that we also have the following 
guarantee. 

Proposition 5. If C < (4p)“P, then for all k > 0: 

Ck^^'^ f{yk) < V’fc := raiuifkix). (3.8) 

X 

Proof. We proceed via induction on A: > 0. The base case fc = 0 is trivial since both sides equal 0. 
Now assume (3.8) holds for some A: > 0; we will show it also holds for A -|- 1. 

Since h is cr-uniformly convex of order p, the rescaled Bregman divergence ^Dh{x,xo) is 
uniformly convex. Thus, the estimate function ifk (3.6) is also 4-uniformly convex of order p. Since 
Zk is the minimizer of fjk, this implies ifk{x) > V’fc + — Zk\f. We then apply the inductive 

hypothesis (3.8) and use the convexity of / (1.12) to obtain the bound, for all x G A: 

V’fc(a^) > Ck^'P'^[f{yk+i) +{Vf{yk+i),yk-yk+i)]+^\\x-ZkW^. (3.9) 
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Then by adding the {k + l)-st term in the definition of ipk+i (3-6) to both sides of (3.9), we obtain: 


ipk+iix) > C{k + l)^P'>[f{y fc+i) T {V/(?/fc_|_i), yk-\-t Tk{x ^-fc))] (3.10) 

where Tk = and in the above we have also used the definition of x^+i as a convex 

combination of yk and Zk with weight Tk (3.2c). 

Note, the first term in (3.10) above gives our desired inequality (3.8) for fc + 1. So to finish 
the proof, we have to prove the remaining terms in (3.10) are nonnegative. We do so by applying 
two inequalities: We apply Lemma 4 to the term {Vf{yk+i), Xk+i — yk+i)- We also apply the 

p 

classical Fenchel-Young inequality (e.g., [9, Lemma 2]): {s,h) + |||h||P > —with the 
choices h = e~p{x — Zk) and s = epCp{k + l)^^“^^V/(7/fc+i). Then from (3.10), we obtain: 


'ipk+i{x) > C{k + l)^P'> 


fiVk+i) + 



pp-i 1 i)}p-i\ 

p-1 (fc + l)(p) j 


1 

£P-1 


\\Vf{yk+i)\\r\ 


Notice that {(A: + 1)^^ i)}p-i < (A: + 1 )(p). Then from the assumption C < (4p) p, we see that the 
second term inside the parenthesis above is nonnegative. Hence we conclude the desired inequality 
tjjk+iix) > C{k + l)^^^/(2/fc+i), finishing the induction. □ 


Finally, we can combine the result of Proposition 5 with the basic estimate (3.7) at x = x*, to 
conclude a convergence rate on Qp, which we summarize in the following theorem. 

Theorem 6. If f is -smooth of order p — 1, h is a-uniformly convex of order p, and C < 
(4p)“P, then the p-th order accelerated gradient algorithm (3.2) has convergence rate: 


fivk) - r < 


Dh{x*,xo) 

Ceak^P^ 



(3.11) 


The result above shows that by plugging in the p-th gradient method Gp into the accelerated 
algorithm (3.2), we boost its convergence rate from 0(l/eA:^“^) to 0(l/eA;^). In particular, the 
family of accelerated gradient algorithms © still maintains the nice sequential pattern of polynomial 
convergence rates, just like 0. 


3.3 Continuous time limit 

We show that the continuous time limit (e —?■ 0) of the accelerated algorithm Gp (3.2) is a second 
order differential equation (with time scaling 5 = eP). This is in contrast to the original p-th order 
gradient method Gp (2.7), which corresponds to the first-order rescaled gradient flow in continuous 
time (with time scaling 6 = e^“^). Here we sketch the transition from discrete to continuous time, 
and in Section 4.2 we will see the other view starting from the continuous time perspective. 

We first note the following difference between the continuous time considerations of Gp and Gp- 
The p-th gradient method Gp is an algorithm that updates the sequence Xk by: 

Xk+l = Ae{Xk) (3.12) 
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where A^:'. X ^ X is the operator that returns the minimizer of the optimization problem (2.7). 
The update (3.12) above is equivalent to modeling the (discrete time) velocity = Xk+i — Xk as 
a function of the current position, Vk = A!^{xk) (= Ae{xk) — Xk)- As e —)■ 0, and by identifying 
Xk+i = Xk = Xt with 5 = we recover a first order differential equation, the rescaled 

gradient flow (2.10). 

On the other hand, the accelerated gradient algorithm Qp maintains three sequences: 

2/fc+l)-Zfc+l) — Ak^t{xkiyk^ ^k) (3.13) 

where the operator Ak^e now changes over (discrete) time k. As in the preceding paragraph, as 
e —)• 0 the update (3.13) above gives rise to a system of first order differential equations in the 
variables Xt,Zt (and Yt = Xt), which is equivalent to a second order equation in Xt. Moreover, 
since Ak,e depends with k, this second order equation has time-varying coefficients. 

Derivation. We now analyze the updates (3.2) in the limit e —>■ 0, starting with the z-update (3.2b). 

• z-update. As noted in Section 3.1, we can write the z-update recursively as: 

Wk = Wk-i - eCpk^P~^'>Vf{yk) (3.14) 

where Wk = Xh{zk), and here we set a (the uniform convexity constant of h) to be 1 for 
simplicity. Now we invoke the hypothesis that the sequences yk,Wk are discrete time snapshots 
of continuous time curves Yt,Wt (similarly for Xk,Zk with respect to Xt, Zt) at each time 
increment <5 > 0. Specifically, we identify yk = Yt, Wk = Wt, and Wk-i = Wts, and use the 
linear approximation Wts = Wt — 6Wt + o{S). Under these identifications, (3.14) becomes: 

Wt = -j Cpk(P-^^Vf(Yt) + o(l). (3.15) 

With time increment <5 we have the correspondence t = 6k or k = t/6, so k^P~^^ ~ kP~^ = 
Thus, on the right hand side of (3.15) we have the factor e/5^. As e —)• 0 and 
(5 —)• 0, for the expression (3.15) above to have a meaningful content, we need to have the two 
variables to scale as e = <5^ (or e = ©(^J*) in general). Under this scaling, (3.15) yields the 
differential equation for Wp. 


Wt = -CptP-^Vf(Yt). (3.16) 

Since Wk = Vh(zk), Wt = Vh{Zt), we also have ^Xh{Zt) = —Ct^~^Xf{Yt), or equivalently: 

Zi = -CptP-^X^h{Zt)-^ V/(Ut). (3.17) 

Notice, (3.17) is a first order equation in time since it involves the velocity Zt, but second 
order in space since it involves the Hessian X‘^h{Zt). 
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• y-update. Observe, the y-update (3.2a) operates on a smaller time scale ep-^. That is, from 
the first order optimality condition of y^ (B.8), and the bound implied by the ^^"^^' -Lipschitz 
assumption of (B.9), we have:® 

\\xk-yk\\ < e^||V/(yA:)||r^ (3.18) 

With the identification Xk = Xt, tjk = Yt, the bound (3.18) above shows the difference 
1 1 

Xt — Yt = O(ep-i) is smaller than our time step 6 = ep (needed for (3.17)). Therefore: 


Xt = Yt. 


(3.19) 


• x-update. The x- update (3.2c) can be written as Xk+i — Xk = j^izk — Xk) + j^iVk — Xk). 
Identifying Xk+i — Xk = SXt, Zk = Zt, and S{k + p) = t, as well as using the bound (3.18) 
with (5 = er, we get \\Xt — ^{Zt — Xt)\\ < ^ ||v/(17)||*~^ —0, which means: 

Xt = ^{Zt-Xt). (3.20) 

Thus, we conclude that the continuous time limit of the accelerated gradient method Gp (3.2) 
is the system of first order differential equations (3.17), (3.19), (3.20). This system is equivalent to 
the following second order differential equation for Xt: 

Xt + + *-Xt^ Xf{Xt) = 0. (3.21) 

Note, even though the accelerated gradient methods Qp use higher-order derivatives of /, in con¬ 

tinuous time they are all second order differential equations (3.21). This is in parallel to—but also 
in contrast from—how the p-th gradient method Gp is a (p — l)-st order algorithm (in space) but 
corresponds to a first order differential equation (in time), the rescaled gradient flow (2.10). 

4 Nesterov flow 

In this section we study the family of second order differential equation (3.21), the Nesterov flow: 

Xt + ^^Xt + CpH^-^X^h + ^Xt^ Vf(Xt) = 0 (4.1) 

where p > 0 is not necessarily an integer, and C > 0 is a constant. Here we assume / is convex 

and continuously differentiable, and h is strictly convex so V^h is invertible. But we make no 
smoothness assumption on / nor uniform convexity assumption on h. 

1 1 

®Lemma 4 gives the reverse ||a;fe — yk\\ > II , so the bound is tight. Indeed, like in Section 2.3, 

1 1 P~‘^ I 

the y-update (3.2a) is a discretized rescaled gradient flow: yk = — 2~p-i er-i V/(a;fe)/||V/(a:fc)||* -I- o(ep-i). 
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Notice, we can equivalently write Nesterov flow (4.1) as follows: 

^Vh(xt + -Xt) = V^h(xt + -Xt) (^Xt + -Xt) = -CptP-^Vf{Xt). (4.2) 
dt \ p J V P J \ P P J 

As shown in Section 3.3, equation (4.2) is the continuous time limit of the p-th accelerated gradient 
method Qp (3.2), for integer p > 2. The rewriting (4.2) is nicer than (4.1) because it avoids the 
potential singularity problem at t = 0 (i.e., no term ^^). Indeed, by setting t —)■ 0 in (4.2) we see 
that if p > 1, then Xq = 0. That is, any Nesterov curve Xt that evolves following (4.1), (4.2) must 
start from being at rest; it is the acceleration Xt that drives the trajectory. 


4.1 Convergence rate via energy functional 

Our interest in Nesterov flow (4.1) stems from it being the continuous time limit of the p-th 
accelerated gradient method Qp, which has convergence rate 0{l/ehP). (Theorem 6). We now show 
Nesterov flow preserves this convergence rate, without any additional assumptions on /, h beyond 
convexity. Define the energy functional: 

£t = CfPifiXt) - n + Dh {x\Xt + ^-X^ (4.3) 

where recall, x* = argmin^, f[x) and f* = f{x*). It has time derivative: 

ft = CptP-^ (^f[Xt) -r + ^(V/(W), W)) - ,x*-Xt- ^-Xt^ . (4.4) 

If Xt is governed by Nesterov flow (4.1), then 8t above simplifies to: 

( 1 . 12 ) 

£t = CptP-\f{Xt) - r + (V/(W), X* - Xt)) < 0 (4.5) 


where the last inequality follows from the convexity of /. This means energy is decreasing over 
time: £t < £o = Dh{x*, Xq), for all t > 0. Since Dh{x*,Xt + |W) > 0, we conclude that a curve 
Xt governed by Nesterov flow (4.1) has convergence guarantee: 


f{Xt)-r < 


Dh{x*,Xo) 

CtP 



(4.6) 


which matches the 0(l/eA:^) convergence rate of Qp (3.11) in discrete time, as claimed. But the 
bound (4.6) holds for all p > 0, and only requires convexity of / (in (4.5)) and h (so that Dh > 0). 

We note, (4.3) is a generalization of the energy functional in [13], who were the first to point 
out that accelerated gradient descent (p = 2, Euclidean case) in continuous time corresponds to the 
second order equation Xt + jXt+Vf{Xt) = 0, and proved a matching 0(l/f^) convergence rate [13, 
Theorem 3.2]. They also remarked on the significance of 3 as being the smallest value of r such 
that the modified equation Xt + jXt + Vf{Xt) = 0 has the same inverse quadratic 0{{r — l)^/t^) 
convergence rate, and this guarantee breaks for r < 3, so there is a “phase transition” at r = 3. 
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We can explain the “phase transition” as follows, setting r = p + 1. If we use ^’^Xt as the 
velocity term in (4.1), then we should increase the weight of Xf{Xt) to in order to get the 

optimal convergence rate 0(l/t^) (4.6). Indeed, we can generalize the energy functional (4.3) to 
£'t = pt{f{Xt) — f*) + Dh{x*,Xt + for any increasing pt > 0 with po = 0. By the same 

calculation (4.4), if Xt + ^^Xt + ^ptXf {Xt) = 0, then £[ < 0 as long as pt < yielding 
convergence rate 0{1/pt). In particular, the equation Xt + jXt + Vf{Xt) = 0 from [13] is the case 
Pt = t‘^/{r — 1)^ with convergence rate 0(1/pt) = 0((r — l)^/t^), consistent with their result. So 
indeed 3 is special, as 3 = p + 1 when p = 2. Using r = p + 1 > 3 requires weighting Vf{Xt) by 
which is necessary for the 0{l/t^) rate in continuous time. 


4.2 Discretizing Nesterov flow 


In this section we examine how to discretize Nesterov flow (4.1) so as to preserve the conver¬ 
gence guarantee. Following the approach in Section 2.3, we choose to discretize the equivalent 
equation (4.2), which can be written as a system of two first order equations: 


Zt = Xt + -Xt (4.7a) 

P 

^Xh{Zt) = -CptP-^Xf{Xt). (4.7b) 

at 

Now suppose we discretize Xt,Zt into sequences Xk,Zk with time step S > 0, that is, if x^ = Xt 
then Xk+i = Xt+s = Xt + SXt, and similarly for Zk = Zt, Zk+i = Zt+5 = Zt + 6Zt. This means k 
discrete iterations, each corresponding to a jump of length <5, are equivalent to the elapse of t = 5k 
continuous time. 

Under this identification, (4.7a) becomes Zk = Xk + ^^{xk+i — Xk), or equivalently: 


p k — p 
Xk+l — ^ 


(4.8) 


which is the same as the a:-update in Qp (3.2c), except here we use Xk instead of pk (which is 
currently not in the algorithm). Moreover, (4.8) uses convex weight |, but it is equivalent to the 
weight = I + o{k) in (3.2c). Note, in (4.8) there is no 5. 

Similarly, (4.7b) becomes ^{'Vh{zk+i) — Xh{zk)) = —Cp{5k)P~^'Vf{xk), which we can recognize 
as the optimality condition of a mirror descent step: 


Zk+i = arg min 

Z 


jcpF ^{Vf{xk),z) + ^Dhiz,Zk)^ 


(4.9) 


which is the same as he z-update in Qp (3.4), except here we use Xf{xk) instead of V/(yfc); 
moreover, (4.9) uses kP~^ instead of the equivalent weighting {k + 1)(P“^) = 0(A:^“^) in (3.4). We 
also see the scaling e = 5^ of the step size e in (4.9) and the time step 5 in the discretization. 

In principle, the two updates (4.8), (4.9) define an algorithm that “implements” Nesterov 
flow (4.7) in discrete time. However, we also want a matching convergence rate guarantee 0(1/ek^) 
(since ek^ = {5kY = f^)-, and unfortunately that doesn’t seem possible with only (4.8), (4.9). We 
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can try to follow the approach in the proof of Theorem 6, and attempt to establish upper and lower 
estimates of the objective / by the estimate function However, a key step in the proof is show¬ 
ing that the remainder of the expression (3.10) is nonnegative, for which we need the result (3.5) 
in Lemma 4 as well as the sequence yk- Thus, we can view the accelerated gradient algorithm 
Qp (3.2) as a discretization of Nesterov flow (4.7), with the introduction of an additional sequence 
yk (3.2a) whose purpose is to guarantee inequality (3.5), which implies the matching convergence 
rate 0(l/e/c^). 

It is curious that we need the sequence yk satisfying (3.5) to make the convergence proof work 

1 1 

in discrete time. This yk differs from Xk by a smaller time scale < ep = 6 (3.18), so as 5 — t - 0, 
Xk and yk have the same continuous time limit Xt = T). We also note that inequality (3.5) is the 
only place where the (p — l)-st order smoothness of / is needed (by Qp, which is used by yk (3.2a)). 
It would be interesting to see whether it is possible to replace Qp in (3.2a) by another algorithm 
that guarantees the same inequality (3.5) under a weaker assumption. 

4.3 Interpretation as Enler-Lagrange eqnation 

Nesterov flow (4.1) looks similar to the second order damped harmonic oscillator equation from 
physics, but with a subtle difference. Recall, in classical mechanics we typically model friction as 
a velocity-dependent force. The equation of motion is then a second order equation involving both 
time derivatives of Xt, e.g., Xt + 2(pXt + Xt = Q for damped harmonic oscillator with “damping 
ratio” C > 0; whereas the ideal (frictionless) harmonic oscillator Xt = —Xt involves no velocity 
term. The presence of 2C,Xt in damped harmonic oscillator changes the nature of the system, from 
conservative (conserves energy St = so ideal harmonic oscillator never stops) to 

dissipative (dissipates energy, the system stabilizes, and the curve Xt converges). 

Like damped harmonic oscillator, Nesterov flow (4.1) is also a second order equation involving 
both time derivatives of Xt, although note that the velocity term in Nesterov flow has time-varying 
coefficient 2^. Nevertheless, we can still capture Nesterov flow under the same general framework 
of dissipative system, but with “logarithmic damping” (instead of linear); see Section 5. 

Concretely, we observe that Nesterov flow can be interpreted as the Euler-Lagrange equation: 

= (4,10) 

where C{x,v,t) is the (Nesterov) Lagrangian functional: 

C{x, V, t) = pE’~^Dh -|- -V, — Cpt‘^^~^ f{x) (4.11) 

defined for any point x € X, tangent vector v G T^X, and time t G M. In (4.10) , ^£(Xt,Xt,t) is 
the partial derivative of £(x, v, t) with respect to v evaluated at (x, v, t) = (Xt, Xt,t), and similarly 
for ^C{Xt, Xt,t). For the Nesterov Lagrangian (4.11), we can calculate these derivatives explicitly 
and verify that (4.10) above is indeed equivalent to Nesterov flow (4.1). 
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Recall, the Euler-Lagrange equation (4.10) is a necessary (and often sufficient) condition for Xt 
to be a stationary point for the following variational problem, for any time to < ti [4, Theorem 1]: 

min f C{Xt, Xtjt) dt (4-12) 

Jto 

where the minimization is performed over all continuously differentiable curves Xt with fixed end¬ 
points XtQ = xo, Xt^ = xi ^ X. The objective function in (4.12) is typically called the action A{X), 
and the problem (4.12) is known as the principle of least action, which has played an important 
role as an equivalent reformulation of much of classical physics.® 

Potential external connections aside, the significance of this observation is in making us aware 
of the rich structure in the family of Nesterov flows. Indeed, the interpretation of Nesterov flow as 
Euler-Lagrange equation allows us access to the techniques and results from (for example) calculus 
of variations, which may not be applicable to the first order rescaled gradient flows.And we 
believe that the discrete time accelerated algorithm Qp (3.2)— whose convergence proof at first 
glance looks like an “algebraic trick”—is actually exploiting this structure. 

Eurthermore, it turns out that the family of Nesterov Lagrangians (4.11) is particularly nice, and 
in fact can be extended to a larger class of Lagrangians that preserves much of the nice properties. 

5 A Lagrangian view of acceleration 

We introduce the family of Bregman Lagrangians: 

d^a,i3,'y{x,v,t) = e®'* (x-|-x) —e^*f{x)^ (5.1) 

where a* G M is the scale function, /3t G M the weight function, and 7 * G M the damping function. 
In the Euclidean case h{x) = jUxH® (£ 2 -norm), the Lagrangian (5.1) simplifies to (notice no at): 

t^a,i3,^ix,v,t) = e®'‘ Q||x|p - e^*/(a;)^ • (5-2) 

Ideal scaling. In general, at,(3t,lt in (5-1) can be arbitrary, but we find there is an ideal sealing 
that is necessary for some results to hold (e.g., (5.3b) simplifies (5.4) to (5.5)): 

$t = 2at + (5.3a) 

= -at + e'^K (5.3b) 

®Chiefly among them Newton’s law of motion Xt = — V/(Xt), which comes from the ideal Lagrangian Co{x, v, t) = 
— f{x), where here / is the “potential” function generating the “force” F{x) = —Xf{x). 

^^Although, we can interpret rescaled gradient flow as the “massless limit” of a Lagrangian flow (Section 5.2). 
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Euler-Lagrange equation. For general functions the Euler-Lagrange equation (4.10) 

= ^C{Xt,Xt,t) for the Bregman Lagrangian C = (5.1) is given by: 

Xt + {-at + e^^)Xt - X^h (Xt + Xf{Xt) 

_i (5.4) 

+ e“‘ (7t -at + e“‘) V^h (x* + {xh (x* + - Xh{Xt)) = 0. 

If satisfies the ideal scaling (5.3b), then the Euler-Lagrange equation (5.4) simplifies to: 

Xi + ^tXt + X^h (Xt + e-“‘Xt) Xf{Xt) = 0 (5.5) 

which we call Bregman flow. In the Euclidean case (5.2), it simplifies to Xt+jt Xt+e^*^ Xf{Xt) = 0. 

Convergence rate via energy functional. Generalizing (4.3), we define the energy functional: 

£, = e^‘-2“*(/(Xi) - n + Dh [x*,Xt + e-^^Xt) . (5.6) 

Assuming the ideal scaling 7 t = —at + e"* (5.3), St has time derivative: 

St = i$t - 2dt)e^‘-2“‘(/(Xt) - n + {Xf{Xt),Xt) 

+ e-“‘ (X^h (X* + e-“‘Xt) (Xt + 7tXt), Xt - x* + e-“‘Xt) . 

If Xf satisfies the Euler-Lagrange equation (5.5), then St simplifies to: 

ft = (A - 2dt)e^*-2“‘(/(Xt) - n - e^*-“*(V/(Xt),Xt - x*). 

Now invoking the convexity of / (1.12), we can bound the expression above by: 

St < [flt - 2at - e“‘) (/(Xt) - /*). (5.7) 

Thus, we see that if flt < 2dt + e"*, then St < 0, which implies St < Sq for all t > 0. Since h is 
convex, Dh{x*,Xt + e““*Xt) > 0, therefore we get the bound {f{Xt) — f*) < fo- This gives 

a convergence rate pt = 2at — flt, which we summarize in the following theorem. Note, for any 
at, the optimal choice of f3t in (5-8) below is given by the ideal scaling (5.3a), which yields rate 
Pt = 2at - flt = /o e^^ds. 

Theorem 7. If f and h are convex and the ideal scaling (5.3b) holds, then for any at, fit satisfying 
flt < 2Q;t -b e“L the curve Xt governed by the Euler-Lagrange equation (5.5) has convergence rate 
Pt = 2at - flt: 

f(X,) -r< ^ = o . ( 5 . 8 ) 
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Example: Nesterov Lagrangian. As our motivating example, Nesterov Lagrangian (4.11) is 


a special case of the Bregman Lagrangian (5.1) when we choose: 

at = — logt + logp (5.9a) 

/3t = (p - 2) logt + 21ogp + logC (5.9b) 

It = (p + 1) logt - logp (5.9c) 

pt = plogt (5.9d) 


which satisfy the ideal scaling (5.3)^^ for any p > 0. With these parameter choices, we can verify 
that Bregman Lagrangian (5.1) reduces to Nesterov Lagrangian (4.11), Bregman flow (5.5) reduces 
to Nesterov flow (4.1), and the convergence rate (5.8) recovers our earlier result (4.6). 

However, the Bregman Lagrangian family (5.1) is much more general, and we wish to study its 
more general properties (which we can then specialize to any subfamily, including Nesterov (4.11)). 
To expand the repertoire of Bregman Lagrangians, in Section 6 we study the family of constant 
at = logc with rate pt = ct, and show connections to the restart scheme proposed by Nesterov [9] 
to obtain linear convergence in discrete time under uniform convexity assumption. 

In the rest of this section, we discuss the interpretation of Bregman Lagrangian as approximating 
the “true” momentum method (Hessian Lagrangian); we will also see how to interpret rescaled 
gradient flow as the massless limit of a (modihed) Lagrangian flow. 

5.1 Bregman Lagrangian as an approximation of Hessian Lagrangian 

We define the family of Hessian Lagrangian: 

(5-10) 

where as before G M is the weight function and 7 * G M the damping function. The Hessian 
Lagrangian (5.10) is a damped and weighted version of the ideal Lagrangian (with Hessian metric) 
Co{x,V,t) = 2 lbllh(a:) - fi^)- 

The Bregman divergence, being a first order approximation error, can be seen as approximating 
the squared Hessian norm: 

e^^Dh{x + e-^v,x) « (5.11) 

for any x € X , v € T^X, and a G M (such that x + e“"u G X). Therefore, we can interpret Bregman 
Lagrangian (5.1) as approximating the Hessian Lagrangian (5.10): ~ for all at,Pt,'yt- 

Note in the Euclidean case (5.11) is an equality so Bregman and Hessian Lagrangians coincide. 
The Euler-Lagrange equation for the Hessian Lagrangian (5.10) is given by: 

\v^h{Xt) Xt Xt + V\{Xt) (w + jt Xt) + e^‘V/(W) = 0 (5.12) 

^^Note, (5.3) only determines up to constant terms, but we will see in Section 7.2 that (5.9) is the proper 

choice of constants from the perspective of time dilation. 
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where the third order derivative V^/i comes from being the derivative of the metric tensor 
Notice if we remove the first term from (5.12), then we recover the Euler-Lagrange equation (5.5) 
for the Bregman Lagrangian in the case at = oo (which is the ideal case since by L’Hopital’s 
rule, lima^oo + e““n, x) = 5 lbll^(a,))- Thus, Bregman flow (5.5) can be interpreted as an 

approximation to the Hessian flow (5.12) that removes the V^/i term (and compensates by using 
V‘^h{Xt + e~°‘^Xt)). Note, Bregman flow can be equivalently written as (4.7), which can then 
be discretized using mirror descent to yield an algorithm (3.2), which does not require V^/i but 
only V/i. Therefore, this offers an interpretation of Nesterov’s acceleration technique as a clever 
approximate discretization of the Hessian Lagrangian, which reduces the complexity of the required 
computation from V^/i in (5.12) down to V/i in (3.2). 

However, it is actually still unclear why the Hessian Lagrangian is the right thing to approxi¬ 
mate. For example, we do not have any convergence guarantee on the Hessian flow. The convergence 
rate pt = /q (5.8) for Bregman Lagrangian tends to oo as a —)> oo. This suggests that in the 

ideal limit at = oo, Hessian flow has instantaneous convergence {f{Xt) — /* < 0 for any t > 0), 
although note also that under the ideal scaling (5.3), /3 ,7 —>■ oo as a —>■ oo so this limit is not well 
defined. 

Let us take a step back and notice, in approximating Hessian norm by Bregman divergence (5.11) 
we have introduced a scale variable a G M, which provides the conversion factor between the scales 
of the point x G df and the tangent vector v G Ordinarily, we treat v as operating at a small 

(infinitesimal) scale e > 0 where linear approximation holds, e.g., f{x -|- ev) = f{x) -|- e{Vf{x),v). 
But in practice, how should we choose a? As noted, the ideal is a = 00, in which case Bregman 
Lagrangian reduces to Hessian Lagrangian. However, as soon as a = log(l/e) < 00, there is the 
ideal scaling (5.3) in Bregman Lagrangian that binds a, fl, 7 together, and renders the limit a —?• 00 
nonsensical. But in return, the ideal scaling gives us convergence rate pt = /q e'^^’ds, which is better 
for larger at- For example, Nesterov Lagrangian (4.11) uses logarithmic at = — logt-|-logp, which 
yields p-sublinear rate pt = plogt. The exponential analog of Nesterov in Section 6.1 uses constant 
at = logc, which yields linear rate pt = ct. 

5.2 Rescaled gradient flow as massless limit of Lagrangian flow 

We define the p-th power Lagrangian, p > 0: 

C{x,v,t) = e*/™(^^||n|r-/(x)) (5.13) 

where m > 0 is the mass of our (fictitious) particle. In Section 5 we implicitly set m = 1, but here 
we are interested in the limiting behavior m —)• 0 , with p fixed. 

The Euler-Lagrange equation = ^C{Xt, Xt,t) for the Lagrangian (5.13) is: 

\\Xt\r^mXt + Xt)+m{p-2)\\Xt\r^{Xt,Xt)Xt + Vf{Xt) = 0. (5.14) 

So as m —)• 0, the Euler-Lagrange equation (5.14) converges to the rescaled gradient flow (2.12): 

\\xt\r^xt+vf{Xt) = o. 
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This gives an interpretation of rescaled gradient flow—which is a first order equation—as the 
massless limit of the Lagrangian flow (5.14), which is a second order equation. However, notice 
that this massless limit also corresponds to infinite momentum: e^^'^m\\Xt\\^~‘^Xt —?• oo as m —)• 0. 
This means as m —)• 0, our particle (whose evolution is governed by (5.14)) becomes infinitely 
massive. In the limit m = 0 (rescaled gradient flow) there is no oscillation; the particle just rolls 
downhill (with infinite “friction”) and stops at the minimum x* as soon as the force —V/ vanishes. 

Note, this may run contrary to the idea that adding an acceleration/momentum term amounts 
to preventing oscillation, hence the faster convergence. What our interpretation suggests is the 
opposite: The first order rescaled gradient flow is the case of infinite momentum and no oscillation 
(rather than no momentum and big oscillation), and the effect of moving to the second order 
Lagrangian flow is to unwind the curve (to finite momentum) where it travels faster, but there is 
now oscillation. For example, the case p = 2 in (5.14) is the damped harmonic oscillator (when 
f{x) = gllxp). This point of view is also consistent with the work [11], who addressed this 
oscillation issue by a restart scheme (cf. Section 6.2). See also Appendix A.4 for an interpretation 
of natural gradient flow as the massless limit of Hessian Lagrangian flow. 


6 Linear convergence rate via uniform convexity 

We study the exponential analog of Nesterov Lagrangian, which uses constant scale function and 
has linear convergence rate. We also show how to extend Nesterov’s restart scheme [9] in discrete 
time to get linear convergence rate when the objective function is uniformly convex. 


6.1 Exponential Nesterov Lagrangian family 

We define the exp-Nesterov Lagrangian, for any c > 0: 


C{x, V, t) = [ Dh[x + -u, X ) — e‘^f{x) ) . 


( 6 . 1 ) 


This is the Bregman Lagrangian (5.1) with the following choices, which satisfy the ideal scaling (5.3): 

at = logc (6.2a) 


I3t = ct +2 log c 
-ft = ct- log c 
Pt = ct. 

The Euler-Lagrange equation for the Lagrangian (6.1) is given by: 


1 ■ 


Xt + cXt + ( Xt + -Xt ] Vf{Xt) = 0. 


-1 


(6.2b) 

(6.2c) 

(6.2d) 


(6.3) 


By Theorem 7, the exp-Nesterov flow Xt (6.3) has linear convergence rate pt = ct for convex /: 

(6.4) 


< f{X,)-r + Dh{x*,Xo + \Xt) ^ 
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Thus, whereas Nesterov flow (4.1) has sublinear rate, exp-Nesterov flow (6.3) has linear rate. 

Let us examine how to discretize exp-Nesterov flow. As in Section 4.2, we first write (6.3) as: 


Zt = Xt + -Xt 
c 

= -ce'=*V/(Xt). 


(6.5a) 

(6.5b) 


We discretize Xt, Zt into sequences Xk, Zk with time step <5 > 0 as before, so t = 6k. Using mirror 
descent to implement (6.5b), we obtain discrete time equations similar to (3.2c), (3.4): 


1 


Zk = argmm ^ce'^^'‘{Vf{xk),z) -Dh{z,Zk-i) 

Xk+i = c6zk + (1 - c6)xk. 


(6.6a) 

(6.6b) 


Note, the weight in (6.6b) is independent of time, but depends on 6; and (6.6a) suggests the step 
size e = d in the algorithm. If our analogy between continuous and discrete time convergence holds, 
then given the 0{e~'^) convergence rate (6.4) for Xt, we expect a matching convergence 

rate in discrete time. 

However, it is not clear how to get that with only (6.6). If we try to adapt the proof of Theorem 6 
(with the ideal choice p = oo), then we find that we need to introduce a sequence satisfying the 
following analog of Lemma 4, in order to conclude a convergence rate 0{6e~^^^): 


{fiyk),xk-yk) > 5e-^'"||V/(yfc)|| 


(6.7) 


Notice, the rates above are consistent if we set e = 6 = 1. But the condition (6.7) means we need 
to make a constant improvement in each iteration from Xk to yk, although we are also free to be 
creative with how to construct yk and impose any assumption on /. 

6.2 Restart scheme for uniformly convex objective fnnction 

We present a generalization of Nesterov’s restart scheme [9] that obtains a linear convergence rate 
when the objective function is both smooth and uniformly convex. This in a sense can be seen as a 
discrete time version of exp-Nesterov flow (6.5), which implements the exponential weighting and 
the improvement requirement (6.7) by running the accelerated gradient algorithm Qp (3.2) for some 
amount of time, within each iteration. 


Linear convergence rate for Qp. Following [9, Section 5], we first show that the p-th gradient 
method Gp has linear convergence rate if / is uniformly convex. Concretely, consider the following 
Gp with larger regularization (like (3.2a)), for p >2: 


Xk+i = argmin I /p_i(x;xfc) - • - ||x - Xk\\^\ 

X [ € p j 


( 6 . 8 ) 


where e > 0 here is a fixed step size. By Lemma 4, we know that if / is -smooth of order p — 1. 


then (Vf(xk+i), Xk — Xk+i) > \ep-^ \\Vf{xk-\-i)\\f ■ If / is u-uniformly convex of order p, then we 
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also have [9, Lemma 3] || V/(xfc_|_i) ||* ^ (/(x^+i) — /*). Moreover, if / is convex (1.12), 

then: 


fjxk) - r < f{xi) - r 

l + liea)^ (l + i(eo-)^)^ 


(6.9) 


If the (inverse) condition number k = ecr is small, then 1 + \{ea)r-^ ~ And again by 

the smoothness of /, we have /(xi) < mina;{/p_i(x; xq) + ^||x — xop} < /* + ~ x*||^. 

, 1 

Therefore, (6.9) yields the convergence rate pk = ck, c = for the p-th gradient method 

Qp (6.8), generalizing the result of [9, (5.6)]: 


f{Xk+l) 


3||xo-x*r 

ep{l + 




( 6 . 10 ) 


which matches the desired convergence rate 0(^e discussed in Section 6.1. 


Improved linear convergence rate for Qp with restart. We now show that a variant of the 
accelerated gradient method Qp attains a better linear convergence rate than Qp (6.8). Specifically, 
consider the following restart scheme, generalizing [9, (5.7)]: 

X{k+i)m = output {jm of running Qp (3.2) for m iterations with input xq = x^m) (6.11) 

1 

where m = 24:p(kp, and k = ea as before is the inverse condition number of /. Here we assume 
we replace the Bregman divergence in the z-update (3.2b) by dp{z) = ^\\z — xop, rescaled by its 
uniform convexity constant 2“^*+^ (3.1). The proof of Theorem 6 still holds in this case, and for 
concreteness we choose C = (4p)“P. 

Then, since / is fj-uniformly convex of order p (1.15), and by the bound (3.11) from Theorem 6: 


^11 

^ ll^(fe+l)m X 


I — f{X[k+l)m) f — 


I Xkm 


-X*\\P 


a 


em 


(p) 


< —\\Xkm - X*\\P 
pe 


( 6 . 12 ) 


where the last inequality follows from our choice of m. Iterating (6.12) and rescaling the index 
k = ^, we obtain \\xk — x*\\p < e“^/™'||xo — x*p. To convert this into a bound on the function 
value, we use the smoothness of /. Let yk be the output of Qp (6.8) with input Xfc. As noted before, 
if / is (P~^)L smooth of order p — 1, then f{yk) — /* < ~ x*p. Therefore, we conclude that: 

/(») - S ° (t (S'13) 

1 _ , -.1 
which matches the convergence rate 0{^e “) as discussed in Section 6.1 with c = Note, 

1 

this linear rate pt = ckp has better dependence for small k = ea than (6.10), generalizing the 
conclusion of [9, (5.8)]. However, the link to continuous time is not as clear as that of the Nesterov 
family. 
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7 Time dilation: Faster convergence by speeding up time 


In this section we introduce the idea of time dilation, and show that a large family of Bregman 
Lagrangians (which include Nesterov and exp-Nesterov) can be interpreted as the result of speeding 
up any single curve. 

7.1 Time dilation 

We recall the argument in Section 1 that optimization in continuous time is easy because we can 
get arbitrarily fast convergence. The idea is that once we have a curve Xt that converges at some 
rate p{t), then we can speed it up to 1* = X^(i'^ with improved convergence rate p{T{t)) > T(t). 

Here r = r(t) G M is a time dilation: A smooth, strictly increasing (hence invertible) function 
defined on the time domain M (or a subset of it), whose inverse is also smooth. The set tX of 
time dilations forms a group under function composition. The identity element is the identity time: 

tiA{t) = t VteM (7.1) 

which is the default time dilation we use, so normally time flows at unit speed: dt\^/dt = 1. 

Traversing a curve Xt at another time speed T) = ^r(t) is equivalent to replacing the default 
time dilation tjd by r G so now time flows at speed dr/dt = f{t). We say that r is /aster than tid 
if f{t) > 1, in which case shifting from t^d to r amounts to speeding up time, and the convergence 
rate p{T(t)) of It is larger than the original rate p{t) of Xt (if p{t) is increasing). 

However, how meaningful is this idea? Recall, our main interest is in understanding the parallel 
behavior between continuous and discrete time optimization, so even if we have a fast rate in 
continuous time, it may not be of interest to us if we don’t know how to implement it in discrete 
time (with matching convergence rate). 

Recall also from Section 1 the example of the gradient flow Xt = — V/(W), which has conver¬ 
gence rate pt = — logt. The sped-up version Yt = X^(^t) satisfies Yt = —f{t)Xf{Yt), but is is not a 
gradient flow anymore (not of the form Yt = —V/(lt) for some function / that does not explicitly 
depend on time). Similarly, speeding up rescaled gradient flow (2.10) yields a curve that is not in 
the rescaled gradient flow family. While these properties inhibit our understanding of how these 
curves relate to each other in continuous time, it turns out Bregman Lagrangian flows (5.5) have 
nice properties under time dilation, which we explore next. 

7.2 Bregman Lagrangian family under time dilation 

We show that Bregman Lagrangian family is closed under the action of the time dilation group tX. 
Note, in general it holds that speeding up a Lagrangian curve (i.e., Euler-Lagrange curve Xt for 
a Lagrangian C) results in another Lagrangian curve, so the space of general Lagrangian curves is 
closed under time dilation. Moreover, the Bregman Lagrangians (5.1) form a special subfamily of 
the Lagrangian space, and we can characterize precisely how they transform under time dilation. 

^^But for convenience, we refer to Yj = as the sped-up version of Xt regardless of whether r is faster than hd. 
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We begin by noting that the sped-up curve Yt = has time derivatives: 


(7.2a) 

(7.2b) 


i7 = f(t)x,(i) 
i7 = f(t)X,(i)+r(t)2x,(i). 

Thus, if Xt satisfies the Euler-Lagrange equation (5.5) for a Bregman Lagrangian L = Ca,i 3 ,^ (5-1), 
then Yt = X^f^t) satishes the Euler-Lagrange equation for the modified Lagrangian = 
where the parameters q;,/ 3,7 are transformed to 

= «r(L+log'^W (7-3a) 

= /3r(t)+ 21ogf(t) (7.3b) 

7 ^^ = 7 r(t)-logr(t). (7.3c) 

This means each t € ^ induces a map L i—?■ on the space of general Bregman Lagrangians: 

lit, It G M}. (7.4) 

Furthermore, by chain rule, we see that the transformation (7.3) satishes the composition property 
for all t,9 ^ and similarly for /3, 7 (that is, speeding up by r and then 6 is 
equivalent to speeding up once by t o 6). This lifts to the Lagrangian level: 

Formally, this means the mapping C 1 —?■ is a (right) group action of ^ on .if, namely, a group 

homomorphism from ^ to the permutation group of .if. 

This conclusion extends to Hessian Lagrangian, which is the a —?• 00 limit of Bregman La¬ 
grangian; thus, ^ also acts on the space of Hessian Lagrangians with the same transformation 
rules (7.3b), (7.3c).^^ 


7.3 The orbit of ideal Bregman Lagrangians 

The ideal scaling (5.3) dehnes a special “one-dimensional” subspace of ideal Bregman Lagrangians: 


■^0 = { ■ a G -c/, fi = 2a+ / e“, 7 = —a -|- 


C 


(7.5) 


where sY is the space of all scale functions a* G M, and we use the shorthand J e" = 

Recall by Theorem 7, the Lagrangian curve Xt of an ideal Lagrangian Ca = ^a,i3,'y G .ifo has 
convergence rate pt = / e“. 

We observe that the family of ideal Bregman Lagrangians is closed under the action of time 
dilation. Indeed, the ideal scaling (5.3) holds for if and only if it holds for 7 ^”). 

Equivalently, £„ G .ifo if and only if C^(t) G .ifo for any r G tT. 

^^Note, we can also show that acts on the family of p-th power Lagrangian v, t) = e^* ||iiP — fix)) 

((5.13) is the case yt = t/m and fit = 0), where now transforms as = d-rit) +plogr(t) ((7.3b) is case p = 2). 
^"^More generally, we can consider the halfplane defined by /? < 2a + J e“, for which Theorem 7 holds. 
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Conversely, if we start from any ideal Lagrangian C G .ifo—say the standard Nesterov La- 
grangian C* {p = 2 in (4.11)) —then we can reach any other ideal Lagrangian a G by 
choosing the time dilation function r = ea 

Therefore, we conclude that the family of ideal Bregman Lagrangians (7.6) is an orbit under 
the action of the time dilation group That is, we can interpret .ifo as the result of speeding up 
any initial Lagrangian (for concreteness C*), over all possible time dilations: 

= rG^}. (7.6) 

Notice, the convergence rate transforms consistently: C* has rate pt = 21ogt, so when we speed it 
up (by r = 62 -/^® ) to Tq,, the rate transforms to pr = 21ogr = Je", as expected. Equivalently, 
we can also note that as a i— = (a o r) + log f, the rate / e“ = / dt transforms to 
J = f e^dr, consistent with the idea that we simply replace time t{= fid) by r. 

7.4 The orbit of Nesterov Lagrangians 

We define the subgroup of polynomial time dilations ^ ^ ■ 

=^poi = {rp{t) =tP : p > 0} (7.7) 

which is isomorphic to the multiplicative group M>o- The subgroup inherits the action on Tq) 
but the action now partitions Tq into (sub)orbits. 

We observe, the family of Nesterov Lagrangians (4.11) forms an orbit under the action of i3poi. 
Indeed by (7.3a), if we start from a* = —logt + log 2 (for £* (4.11)), then for any p > 0, the 
time dilation T{t) = t^ sends us to (Q;*)i'^i = — logt + logp. Therefore, we can view the Nesterov 
flows (4.1) as the result of speeding up the Lagrangian flow of £* [p = 2), or any starting curve. 

Moreover, as we have seen in Section 3, this speedup can be implemented in discrete time with 
matching convergence rate as the accelerated gradient methods Qp (3.2). Thus, we can interpret 
the family of accelerated gradient methods © as being the result of “speeding up” the algorithm Q 2 
(accelerated gradient descent) in discrete time—which we achieve via passage to continuous time, 
and at the cost of higher-order smoothness assumption on /. 

7.5 The orbit of exp-Nesterov Lagrangians, isomorphic to Nesterov Lagrangians 

From the standard Nesterov Lagrangian £* (p = 2 in (4.11)), we can use time dilation r = e2* 
to reach the exp-Nesterov Lagrangian C = Cc (6.1) with a = logc, which has linear rate pt = ct, 
c > 0. Recall, from C* we can generate the Nesterov Lagrangians (4.11) as an orbit of the action 
of l3poi- The dilation £* 1 -^ Cc (via the exponential time dilation r = e2*) generates an equivalent 
orbit starting from Cc, which turns out to be the family of exp-Nesterov Lagrangians (6.1). 
Concretely, we define the subgroup of linear time dilations ^ ^ ■ 

^lin = {^c(t) = ct: c> 0} (7.8) 

^®Explicitly, if = — logf-|-log2 is the scale function for C*, then we can check that = ct* (t) +logr = a. 
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which is isomorphic to ^oi- The discussion above says that the family of exp-Nesterov Lagrangians 
is an orbit under the action of This means, in a precise sense, the Nesterov and exp-Nesterov 
families are isomorphic. Moreover, we can speed up Nesterov Lagrangian to get exp-Nesterov 
Lagrangian via an exponential time dilation function r. Furthermore, the associativity of the 
group action means we can speed up time to go from Nesterov to exp-Nesterov and back, and the 
results will remain consistent. 

We can summarize our discussion by saying that all triangles in the following diagram commute 
(and we can reverse any arrow by replacing r with r“^): 




exp-Nesterov 6 > 0 
at = log b 
Pt = bt 
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A Natural gradient descent and mirror descent 

We review the equivalence between natural gradient descent and mirror descent. 

A.l Natural gradient descent and natural gradient flow 

Natural gradient descent. We can interpret natural gradient descent as the solution to a 
modified optimization problem, where we now measure the norm of the displacement using the 
Hessian metric: 


X]^ + Ufc 

Vk = argmm|/(xfc) + {Vf{xk),v) + ^ . (A.la) 

This means at each point x in T we have a local inner product and norm: 

{u,v)h{x) = = {v,v)h(^) = {v,V^h{x)v). (A.2) 

Natural gradient flow. The continuous time limit (e —)• 0 with time scaling t = ek) of natural 
gradient descent (A.l) which we call natural gradient flow is simply gradient flow on this space: 

W = -V^h{Xt)-^V f{Xt) (A.3) 

Convergence of Natural Gradient Flow Define the energy functional 

£t = t{f{Xt)-f) + Dh{x*,Xt) 

where x* = argmin^, /(x) and f* = f{x*). It has time derivative: 

£t = f{Xt) -r+ t{Xf{Xt),Xt) - (^^Xh{Xt),x* - Xt) 


Threrefore, if Xt is governed by natural gradient flow (A.3), £t simplifies to 

£t = {f{Xt) -r + {Vf{Xt),x* - Xt)) + t{Vf{Xt),Xt) < t{Vf{Xt),Xt) < 0 


(A.4) 


using the convexity of /. The energy is therefore decreasing over time: £t < £o = D{x*, Xq), f > 0. 
Thus, we conclude that natural gradient flow (A.3) has convergence guarantee: 
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A.2 Mirror descent and mirror flow 

Mirror descent. The link between natural gradient descent and mirror descent is given by 
the Bregman divergence. The Bregman divergence, being a hrst order approximation error, has the 
property that it approximates the Hessian norm (A.2) without requiring the second order derivative: 

Dh{y,x) ^ ]^{y-x,V‘^h{x){y-x)) = ]^\\y-x\\l^^^y (A.5) 

If we replace the Hessian norm in the optimization problem (A. la) defining natural gradient descent, 
then we obtain mirror descent: 


1 


Xk+l Xk T Vk 

Vk = argmm|/(xfc) + (V/(xfc),u) + ^ Dh{xk + v,Xk) 

Setting the derivative of (A.6) to zero, we can also write mirror descent explicitly as: 

Vh{xk+i) = Vh{xk) - eVf{xk). 


(A.6) 


(A.7) 


Mirror flow. The continuous time limit (e —)• 0 with time scaling t = eh) of mirror descent (A.7) 
is the mirror flow, which is the system: 


Zt = Vh{Xt) (A.8a) 

Zt = -Vf{Xt) (A.8b) 

Therefore, mirror flow (A.8) is equivalent to natural gradient flow (A.3): 

^V/i(Ai) = Zt = -V/(At) Ai = -X^h{Xt)-^X f{Xt). (A.9) 

Furthermore, mirror flow still has the same 0{l/t) convergence in continuous time for any convex 
function /. While mirror descent and mirror flow have matching convergence rates, it is difficult 
to prove any convergence rate for natural gradient descent; all we can say is that it is a descent 
method if / is smooth with respect to || • \\h{x)- 

A.3 Natural gradient flow and mirror flow equivalence 

In (A.9) we showed an “algebraic trick” that shows how the same differential equation can be written 
in two different ways, demonstrating the equivalence between mirror flow and (natural) gradient 
flow. Formally, we can understand this trick as the manifestation of the following property: mirror 
flow is the pushforward of natural gradient flow under the mapping $ = V/i. In particular, mirror 
flow is also a gradient flow in the “dual manifold” Z = Xh{X). To illustrate this property, we show 
how the gradient flow changes when we transform the space. 
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Mapping the space. Suppose we map X \,o Z = ^{X) by a bijective smooth map: 

X ^ Z 

with inverse map (also smooth) 'h = ■. Z ^ X . The objective function /; Tf —)■ M transforms 

to a new objective function /: .Z —)• M given by: 

f = f 


so that if z = <h(x): 


f{z) = /($(x)) = f{x). 


However, note that / is not necessarily convex (in z), even if / is (in x). 


(A.IO) 


How the gradient changes. We will use the following notation: 

df{x,z) 


dxf{xo,zo) = 


dx 


{x,z)={xo,zq) 


The new objective function / now has gradient at zq (in the new space Z): 

V/(zo) = dJ{zo) = dz{f o T)(zo) = Jvi>(zo) dxf{^{zo)) = ,h{zQ) V/(T(zo)) (A.ll) 

where J>ir(zo) is the Jacobian (partial derivatives) of 'h(z) at z = zq; represented as a matrix: 


(Jv,>(zo))^. = di(^(zo)),- = 


z=zo 


(A.12) 


How the metric changes. Suppose X has metric g(x) at point x (e.g., Hessian metric g = V^/i). 
Then we can obtain a corresponding metric m. Z = ‘h(A’) which is the pullback metric of g under 
the inverse mapping T = 


g = 

Explicitly, this means the inner product at the point zq = <h(xo) is given by: 

{u,v)zo = {J^{zo)u, J^{zo)v)xo = {J^{zo)u,g{xo)J^{zo)v). 

That is, the metric g(xo) at xq = 'h(-Zo) now becomes the metric g(zo) at zq: 

g(zo) = (T*g)(zo) = Jvi,(zo)^ (go T)(zo) Jvi/(zo). (A.13) 
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How the gradient flow changes. Suppose in the original space X we have gradient flow, which 
(with the general deflnition of metric g) is given by: 

Xt = -g{Xt)-^Vf{Xt). 

In the new space Z = ^{X) with the pushforward metric g = ^'*g and the new objective function 
/ = / o the natural gradient flow equation (A.3) becomes: 

Plugging in (A. 13) then gives us: 

Zt = - [MZt)^g{^{Zt))MZt)]-^ MZt)vf{^{Zt)) 

= -MZt)-^ g{^{Zt))-^ Vf{^{Zt)). (A.14) 

Mirror map from Legendre duality. Now consider the Hessian metric again: 

g = V^h. 

In this case, there is a very nice choice of 'h, called the mirror map: 

4- = Xh*. 

Here h* : X* —>■ A is the Legendre dual function, defined on the space of linear functionals X*: 

h*{z) = sup {z,x) — h{x). (A.15) 

X 

The optimum in (A.15) is achieved by x satisfying z = Xh{x). So for all x € X, we have the 
relation: 

h*{Xh{x)) = {Xh{x),x) - h{x). (A.16) 

Similarly, since {h*)* = h, for all z € Z we also have: 

h{Xh*{z)) = {Xh*{z), z) - h*{z). (A.17) 

Comparing (A.16) and (A.17), we conclude: 

z = Vh{x) x = Xh*{z) 

which means Xh* = {Xh)~^, so for all z G Z: 

Xh{Xh*{z)) = z. 

Differentiating (calculating the Jacobian with respect to z) of the expression above gives us: 

X‘^h*{z) X^hiXh*{z)) =1. 
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So with the Hessian metric g = V^/i and the mirror map '& = V/i* (<h = V/i), we have: 

J^{Zt) = d,Vh*iZt) = S/^h*{Zt) 
g{^{Zt)) = V\{Vh*{Zt)) = {V^h*{Zt))~" 

Notice how the choice of the mirror map makes Jq/{Zt) and g{^{Zt)) cancel each other. Therefore, 
the pushforward of the natural gradient flow (A. 14) in Z is indeed the mirror flow (A. 8): 

Zi = -V/(Xi) 

A. 4 Natural gradient flow as massless limit of Hessian Lagrangian flow 

We consider the damped Lagrangian using the Hessian metric: 

£(W, - /(W)) (A.18) 

The Euler-Lagrange equation corresponding to (A.18) is: 

'^V^h{Xt)XtXt + V^h{Xt)imXt + m^tXt) + Vf{Xt) = 0 (A.19) 

when = t/m, then in the limit m —>■ 0, the equation (A.19) converges to the first order equation 

X^h{Xt)Xt + Xf{Xt) = 0 

which is equivalent to the natural gradient flow (A.3) 

B Deferred proofs 

B. l Proof of Lemma 1 

We prove Lemma 1, following the technique of [9, Theorem 1]. Since / is -smooth of order 
p — 1, and we define Xk+i by (2.7), we have the following bound: 

f{xk+i) < min |/(x) + - •-||x - Xfcpj . (B.l) 

X [ e p j 

Moreover, choosing u = x* — in (B.l) gives us the bound: 

/(xfc+l) -/*<-• -\\x* - XfcP. 

e p 

For any A G (0,1), consider the midpoint: 

XA = X* + (1 - A)(xfc - X*) = Ax* + (1 - A)xfc. 

Then from the convexity of / (1-12), we have the bound /(xa) < A/* + (1 — A)/(xfc). Plugging this 
to (B.l) with the choice v = x\ — x^, we find: 

f{xk+i) < fixx) + ---\\xx-xk\\^ < Xf* + {I - X)f{xk)+ -■-R^ 
e p e p 
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Then with our notation 6^ = f{xk) — /*, we can write the last inequality above more precisely as: 

<5fc+i < (1 — X)5k + - ■ -R^ A^. 

e p 

Minimizing the right hand side with respect to A yields the optimal bound: 


5k+l < — 


(P- 1) 


P 


e6l \ r-i 


2RP 


(B.2) 


Now consider the energy functional ^ ; we have: 


e/c+i — ek = 


1 1 6k - 4+1 


1 1 
rp-1 rp-1 

°k+l 4 


^k+l 
1 1 
rp-l cp-l 

^k+l ^k 


1 


1 1 
ep-l ep -1 

°fc+l 




(B.3) 


We know that Qp is a descent method, so Ck+i > ek- The summation in the denominator of (B.3) 


P-2 


can be bounded above by {p — 1)6^ ^, and we can lower bound 4 — 4-i using (B.2). Therefore: 

(p - 1) / \ 1 1 1 


Cfc+l — Cfc > 


P 


2RP y 


4"-‘ 


(p-l)5f-' ^ 


1 

P-1 


Summing (B.4) and telescoping the terms, we get: 


(B.4) 


k 


ek > ek — eo>- 


p \2RP 


1 

P-1 


which gives us the conclusion. 


□ 


B.2 Proof of Theorem 3 

Consider the energy functional (2.14): 

Et = {f{Xt)-r)-^\ 

Its time derivative is: 

^_^. (v/(Xt),w) 

(P-I) (/(W)-/4r^ 

If Xf evolves following the rescaled gradient flow (2.12), then Et simplifies to: 

p ^ 1 / ||V/(Xt)||, \i^ 

* (p-1) [f{Xt)-f*) ■ 

Notice that by the convexity of /, we have 

0<f{Xt)-r < {Xf{Xt),Xt-x*) < \\Xf{Xt)\U-\\Xt-x* 


(B.5) 


(B.6) 
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and therefore, from (B.6) we obtain a bound: 


Et > 


1 


1 


ip 1) \\Xt — X* II p-i 

Integrating, this gives a lower bound on Et: 

r. r. 1 /■* 1 

Et > Eo + 


{p 1)^0 ||Xt-— X*IIp-i 


dr. 


(B.7) 


Furthermore, under the additional mild assumption that the level sets are bounded, the bound (B.7) 
simplifies to: 


t 


Et > Eo ^ 

{p — 1) 

Now recalling the definition (B.5), this lower bound becomes an upper bound on the function values: 

/ x-(P-i) 

f{Xt) - r < Uo +-^ 

V {p-l)Rp-^J 

Finally, replacing Flo > 0 by 0 completes the proof. □ 


B.3 Proof of Lemma 4 

We follow the proof of [9, Lemma 6]. Since solves the optimization problem (3.2a), it satisfies 
the optimality condition: 


P-i 1 , .2 

y] -r- - iVk - Xky~^ + -Wvk - Xk\y~‘^ iVk - Xk) = 0. 

^(z-1)! 


(B.8) 


Furthermore, since is ^^^' -Lipschitz, we have the following error bound on the {p — 2)-nd 

order Taylor expansion of V/ : 


P-i , 

xfivk) - E 

i=i 


< 


1 , 


- Xk\y ^ 


(B.9) 


Substituting (B.8) to the square of (B.9) and writing r = \\yk — Xk\\, we obtain: 


r.2p-2 

^2 - 


2rP-2 

Xfiyk) H-^ iVk - Xk) 


Upon expanding the square and rearranging, we get the inequality: 

f- *^T*^ 

{Xfiyk),xk-yk) > 4^l|V/(yfc)||y 


(B.IO) 
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Note that if p = 2, then the first term in (B.IO) above already implies the desired bound (3.5). 
Now assume p > 3. The right hand side of (B.IO) is a convex function of r, and it is minimized by 
~ ^^ll^/(yfe)ll*} ) yielding a lower bound of (B.IO) that is now independent of r: 


(V/(yfc),Xfc - yk) > 
> 


1 // 3p \ 2p-2 /p — 2\2p-2\ 

j6i^||V/(yfc)||,^ 


1 

^p-1 


P. 


l|V/(yA:)|| 


P-1 

* 


as desired. 


□ 
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